feat: zarr3 #220

floriankrb · 2025-02-26T16:46:45Z

Description

Anemoi-datasets should be agnostic to the version of zarr. It should run with zarr3 installed or with zarr2 installed.

We should also take into account that zarr2 code cannot read datasets created by zarr3. So, ideally, we should
1 - update anemoi-datasets to work with zarr2 and zarr3
2 - keep using zarr2 to build datasets (dependency anemoi-datasets[create] on zarr<=2, but have zarr2 or 3 when reading (dependency of anemoi-datasets on zarr
3 - when user have updated their environment (6 months?), start building zarr3 datasets

This PR mostly addresses the first point, making anemoi-datasets detect the version of zarr and adapt to it.
What is still missing is:

some work on src/anemoi/datasets/data/stores.py, we had performance/crashing issues when reading on S3 buckets, this is difficult to reproduce or benchmark. Moreover, the inferface to have custom stores has changed between zarr2 and zarr3, the feature we used is not available (yet?).
zarr3 misses some support for datetime data which is a feature that was present in zarr2 and we use.

📚 Documentation preview 📚: https://anemoi-datasets--220.org.readthedocs.build/en/220/

codecov-commenter · 2025-02-27T09:37:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.85%. Comparing base (80de4c6) to head (1459e83).
Report is 62 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #220      +/-   ##
==========================================
+ Coverage   72.96%   73.85%   +0.88%     
==========================================
  Files          10       10              
  Lines         825      872      +47     
==========================================
+ Hits          602      644      +42     
- Misses        223      228       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

anaprietonem · 2025-06-03T11:25:06Z

@floriankrb what was the conclusion in terms of adding/updating to zarr 3?

floriankrb · 2025-06-04T13:17:33Z

@anaprietonem I updated the description of the PR

for more information, see https://pre-commit.ci

frazane · 2025-06-25T12:29:49Z

Just adding a small comment here. You've probably considered this already but here we go.

The transition to Zarr v3 specification comes with a nice opportunity: we could use the sharding feature! It would allow us to have chunked variables (avoiding having to load them all in memory before taking a subset) while still retaining very high read speeds. Basically we would make each timestamp a shard, and then chunk variables inside the shard. The downside is slower write speed, but it’s not very important.

More info:

Some quick benchmarking of the sharded format zarr-developers/zarr-python#1338

b8raoult · 2025-06-26T10:26:35Z

All tests are now passing with zarr2 and zarr3. At the moment of writing, zarr3 still does not support datetime64 (zarr-developers/zarr-python#2616)

Tests are much more slower with zarr3, this will require more investigation. For example test_slice_4 takes 98s with zarr2 (1.5 minutes) and 612s with zarr3 (10 minutes)

The profiler shows for zarr2:

   773147    0.339    0.000   95.489    0.000 .../anemoi-datasets/src/anemoi/datasets/data/stores.py:148(__getitem__)
   773165    1.553    0.000   95.152    0.000 .../python3.12/site-packages/zarr/core.py:656(__getitem__)

and for zarr3:

   773147    0.449    0.000  615.551    0.001 .../anemoi-datasets/src/anemoi/datasets/data/stores.py:148(__getitem__)
   773165    2.088    0.000  615.118    0.001 .../python3.12/site-packages/zarr/core/array.py:2294(__getitem__)

The test uses an in-memory zarr store. The test calls __getitem__ 770k time. With a time per call of 0.001s, we have 0.001 x 770000 is 770, which is in the ballpark of 615.

tjhunter

@floriankrb @b8raoult this is fantastic, thank you for the hard work! We will try it this week or next and get back to you.

tjhunter · 2025-06-27T04:50:45Z

src/anemoi/datasets/zarr_versions/zarr2.py

+
+    def __delitem__(self, key: str) -> None:
+        """Prevent deletion of items."""
+        raise NotImplementedError()


this operation is illegale, it is not a not implementer error

tjhunter · 2025-06-27T04:50:52Z

src/anemoi/datasets/zarr_versions/zarr2.py

+        """Prevent deletion of items."""
+        raise NotImplementedError()
+
+    def __setitem__(self, key: str, value: bytes) -> None:


tjhunter · 2025-06-27T04:52:12Z

src/anemoi/datasets/zarr_versions/zarr2.py

+    def __getitem__(self, key: str) -> bytes:
+        """Retrieve an item from the store and print debug information."""
+        # print()
+        print("GET", key, self)


I am putting linter rules in wegen to disable all print statements

tjhunter · 2025-06-27T04:54:14Z

src/anemoi/datasets/zarr_versions/zarr2.py

+from typing import Any
+from typing import Optional
+
+import zarr


i think you could move this import into create_array and put a version check there (you already have it in zarr_2_or_3)

tjhunter · 2025-06-27T04:55:43Z

src/anemoi/datasets/zarr_versions/zarr3.py

+def create_array(zarr_root, *args, **kwargs):
+    if "compressor" in kwargs and kwargs["compressor"] is None:
+        # compressor is deprecated, use compressors instead
+        kwargs.pop("compressor")


nit: no need for checking if the key is there. you can do kwargs.pop("x", None)

tjhunter · 2025-06-27T04:56:53Z

src/anemoi/datasets/zarr_versions/zarr3.py

+    import numpy as np
+
+    if dtype == "datetime64[s]":
+        dtype = np.dtype("int64")


tjhunter · 2025-06-27T07:03:41Z

src/anemoi/datasets/create/__init__.py

+            else:
+                LOG.warning("⚠️" * 80)
+                LOG.warning(
+                    f"Only Zarr version 2 is supported when creating datasets, found version: {zarr.__version__}"


we should not support writing in zarr3 format for the time being, only reading. Is there a use case for it?

github-actions bot added tests enhancement New feature or request labels Feb 26, 2025

feat: zarr3

07edf73

floriankrb force-pushed the feature/zarr3 branch from 8371f18 to 07edf73 Compare February 27, 2025 09:30

github-actions bot added the dependencies Pull requests that update a dependency file label Feb 27, 2025

zarr3

1459e83

floriankrb force-pushed the feature/zarr3 branch from 7c0f187 to 1459e83 Compare February 27, 2025 10:21

anaprietonem assigned floriankrb Feb 27, 2025

floriankrb marked this pull request as ready for review March 12, 2025 15:59

floriankrb requested review from theissenhelen, JesperDramsch, gmertes, b8raoult, anaprietonem, HCookie, JPXKQX and mchantry as code owners March 12, 2025 15:59

floriankrb marked this pull request as draft March 28, 2025 09:17

Merge branch 'main' into feature/zarr3

870b936

floriankrb mentioned this pull request Jun 23, 2025

Support for zarr 3 #290

Open

b8raoult and others added 2 commits June 24, 2025 12:43

Merge branch 'main' into feature/zarr3

da6995a

[pre-commit.ci] auto fixes from pre-commit.com hooks

e43f0f6

for more information, see https://pre-commit.ci

b8raoult marked this pull request as ready for review June 24, 2025 11:43

b8raoult requested a review from a team as a code owner June 24, 2025 11:43

tjhunter mentioned this pull request Jun 24, 2025

Time-bounded investigation: use experimental branch of anemoi-datasets with zarr3 features ecmwf/WeatherGenerator#384

Open

b8raoult added 2 commits June 25, 2025 10:47

update

59d3a9e

update

5c870dc

b8raoult added 3 commits June 25, 2025 11:30

update

f6c2683

update

a7a52e2

update

5ed4442

b8raoult and others added 3 commits June 26, 2025 09:55

add cli options

68cccb3

Merge branch 'main' into feature/zarr3

60f723e

fix doc

b844c42

github-actions bot added the documentation Improvements or additions to documentation label Jun 26, 2025

b8raoult added 3 commits June 26, 2025 16:42

feat: add planetary planetary source

deca406

Merge branch 'feat/planetary-computer' into feature/zarr3

ec48d54

add tests

c72fe27

tjhunter reviewed Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: zarr3 #220

feat: zarr3 #220

Uh oh!

floriankrb commented Feb 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov-commenter commented Feb 27, 2025 •

edited

Loading

Uh oh!

anaprietonem commented Jun 3, 2025 •

edited

Loading

Uh oh!

floriankrb commented Jun 4, 2025

Uh oh!

frazane commented Jun 25, 2025 •

edited

Loading

Uh oh!

b8raoult commented Jun 26, 2025 •

edited

Loading

Uh oh!

tjhunter left a comment

Uh oh!

tjhunter Jun 27, 2025

Uh oh!

tjhunter Jun 27, 2025

Uh oh!

tjhunter Jun 27, 2025

Uh oh!

tjhunter Jun 27, 2025

Uh oh!

tjhunter Jun 27, 2025

Uh oh!

tjhunter Jun 27, 2025

Uh oh!

tjhunter Jun 27, 2025

Uh oh!

Uh oh!

feat: zarr3 #220

Are you sure you want to change the base?

feat: zarr3 #220

Uh oh!

Conversation

floriankrb commented Feb 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

codecov-commenter commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

anaprietonem commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

floriankrb commented Jun 4, 2025

Uh oh!

frazane commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b8raoult commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

floriankrb commented Feb 26, 2025 •

edited by github-actions bot

Loading

codecov-commenter commented Feb 27, 2025 •

edited

Loading

anaprietonem commented Jun 3, 2025 •

edited

Loading

frazane commented Jun 25, 2025 •

edited

Loading

b8raoult commented Jun 26, 2025 •

edited

Loading